In this script, it’s intended to give an explanation on data features, using different methods and graphs. The provided data description is based on three datasets:
The joined dataset which only haves information from 2024
The joined dataset which contains information from 2015 to 2024
The WHO TB burden estimates [>1Mb] dataset, as it contains information from previous years
Overall, the document aims to do some light-weight analysis and data exploration, prior to the heavy-weight analysis in the subsequent 2x analysis files.
Loading relevant libraries:
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)library(ggplot2)#Access to the function for loading the datasets and to save themsource("99_proj_func.R")
Loading data:
Joined dataset, year 2024:
#Loading age and sex and risk group data (2024 only) - Joined and augmented version of the data: data_file <-"03_aug_TB_age_sex.tsv"TB_age_sex_joined <-load_data(data_file)
Loading ../data/03_aug_TB_age_sex.tsv from local file…
#Display the data: slice_sample(TB_age_sex_joined, n=5)
# A tibble: 5 × 18
country year age_group sex risk_factor TB_cases_best TB_cases_min
<chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 Iraq 2024 25-34 Male no risk fa… 1300 290
2 Ireland 2024 35-44 Both no risk fa… 69 36
3 Republic of Mold… 2024 20-24 Fema… no risk fa… 19 8
4 Kazakhstan 2024 20-24 Male no risk fa… 240 84
5 Haiti 2024 15+ Fema… no risk fa… 8400 5800
# ℹ 11 more variables: TB_cases_max <dbl>, population_size <dbl>,
# total_TB_cases_best <dbl>, total_TB_cases_min <dbl>,
# total_TB_cases_max <dbl>, TB_cases_pr_100k_best <dbl>,
# TB_cases_pr_100k_min <dbl>, TB_cases_pr_100k_max <dbl>,
# total_TB_cases_pr_100k_best <dbl>, total_TB_cases_pr_100k_min <dbl>,
# total_TB_cases_pr_100k_max <dbl>
Joined dataset, year 2015 to 2024:
#Loading age and sex and risk group data (2024 only) - Joined version of the data: data_file <-"03_aug_TB_10_years.tsv"TB_10_years_joined <-load_data(data_file)
Loading ../data/03_aug_TB_10_years.tsv from local file…
#Display the data: slice_sample(TB_10_years_joined, n=5)
#Loading the data dictionary: data_file <-"01_load_dictionary.tsv"TB_dictionary <-load_data(data_file)
Loading ../data/01_load_dictionary.tsv from local file…
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
slice_sample(TB_dictionary, n=5)
# A tibble: 5 × 4
variable_name dataset code_list definition
<chr> <chr> <chr> <chr>
1 new_sp Notification <NA> New pulmonary smear-posit…
2 miners_screen Policies and services <NA> (if miners_screen_data_av…
3 dst_ptd Laboratories <NA> Number of sites providing…
4 newrel_m5564 Notification <NA> New and relapse cases (bu…
5 e_inc_tbhiv_100k_hi Estimates <NA> Estimated incidence of TB…
Data description:
TB_age_sex_joined (2024) - Description:
This dataset contains the number of TB cases across different countries, categorized by age group and gender. It also includes cases per 100,000 population, enabling standardized and comparable analysis between countries.
slice_sample(TB_age_sex_joined, n=5)
# A tibble: 5 × 18
country year age_group sex risk_factor TB_cases_best TB_cases_min
<chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 Madagascar 2024 5-14 Fema… no risk fa… 2200 0
2 United Republic … 2024 25-34 Fema… no risk fa… 7300 0
3 Zimbabwe 2024 0-14 Both no risk fa… 6100 2500
4 Colombia 2024 10-14 Male no risk fa… 160 55
5 French Polynesia 2024 20-24 Both no risk fa… 6 2
# ℹ 11 more variables: TB_cases_max <dbl>, population_size <dbl>,
# total_TB_cases_best <dbl>, total_TB_cases_min <dbl>,
# total_TB_cases_max <dbl>, TB_cases_pr_100k_best <dbl>,
# TB_cases_pr_100k_min <dbl>, TB_cases_pr_100k_max <dbl>,
# total_TB_cases_pr_100k_best <dbl>, total_TB_cases_pr_100k_min <dbl>,
# total_TB_cases_pr_100k_max <dbl>
We can briefly explore the big differences between comparing countries based on their total TB cases vs comparing with TB cases pr. 100k citizens.
Scatter plot of the countries with the top 10 most TB cases in total:
#Getting the 10 countries with highest total amount of TB cases: top_10_countries <- TB_age_sex_joined |>group_by(country) |>summarise(total_TB =first(total_TB_cases_best)) |>arrange(desc(total_TB)) |>slice_head(n =10) |>pull(country)#Making a tibble for those countries, and plotting them: plot <- TB_age_sex_joined |>filter(country %in% top_10_countries) |>group_by(country) |>summarise(mean_best =sum(TB_cases_best, na.rm =TRUE), #Sum of TB cases for this country min_val =sum(TB_cases_min, na.rm =TRUE),max_val =sum(TB_cases_max, na.rm =TRUE), ) |>ggplot(aes(x = mean_best, y =fct_reorder(country, mean_best))) +#fct_reorder(country, mean_best) ensures that we order sort all countries,#based on mean_best value (descending order). geom_point(size =3, color ="orange") +geom_errorbarh(aes(xmin = min_val, xmax = max_val), height =0.2) +labs(x ="TB cases\n(Best estimate with min/max)",y ="Country",title ="Top 10 Countries - Total TB cases, with error bars" ) +theme_minimal()#Save itggsave(filename ="../results/04_1_top10_TB.png", #Choose the folder + filenameplot = plot,width =8, # inchesheight =5, # inchesdpi =300# high quality)plot
Note that really large countries like China and India is part of this graph.
That is not to state that TB is not a problem in these countries, but it is to highlight that the TB intensity might not be as bad as you might think.
This will make sense once you glance at the following plot.
Scatter plot of the countries with the top 10 most TB cases pr. 100k citizens (standardized):
#Making an object for storing the top 10 countries with most TB cases: top_10_countries_100k <- TB_age_sex_joined |>group_by(country) |>summarise(total_TB_cases_pr_100k_best =first(total_TB_cases_pr_100k_best),#Note: The value is constant for each country, so we just use first()#in order to pick the first value (we want to reduce several rows #to 1 row pr. country) ) |>arrange(desc(total_TB_cases_pr_100k_best)) |>slice_head(n =10) |>pull(country)#Making a tibble for those countries, and plotting them: plot <- TB_age_sex_joined |>filter(country %in% top_10_countries_100k) |>group_by(country) |>summarise(mean_best =first(total_TB_cases_pr_100k_best),min_val =first(total_TB_cases_pr_100k_min),max_val =first(total_TB_cases_pr_100k_max), ) |>ggplot(aes(x = mean_best, y =fct_reorder(country, mean_best))) +#fct_reorder(country, mean_best) ensures that we order sort all countries,#based on mean_best value (descending order). geom_point(size =3, color ="orange") +geom_errorbarh(aes(xmin = min_val, xmax = max_val), height =0.2) +labs(x ="TB cases pr. 100k\n(Best estimate with min/max)",y ="Country",title ="Top 10 Countries - TB cases pr. 100k citizens, with error bars" ) +theme_minimal()#Save itggsave(filename ="../results/04_2_top10_TB_100k.png", #Choose the folder + filenameplot = plot,width =8, # inchesheight =5, # inchesdpi =300# high quality)
plot
Would you look at that!
Except for the Philippines, none of these countries were part of the plot for the “top 10 total TB cases countries”.
It goes to show that the standardized TB cases pr. 100k of citizens might be a better measure for TB disease intensity of a country.
*TB_10_years_joined (2015-2024) - Description:
We combined three WHO datasets — TB burden estimates, MDR/RR-TB burden estimates, and TB infection in household contacts — into a single multi-country, multi-year panel covering approximately 10 years.
The merged dataset contains measures of TB incidence and mortality, MDR/RR-TB incidence, and estimated household infection rates for each country–year.
This integrated dataset allows us to describe global TB trends, compare drug-resistant and drug-sensitive TB, and evaluate household transmission indicators.
(Note: Multidrug-resistant tuberculosis (MDR-TB) is defined as disease due to Mycobacterium tuberculosis that is resistant to isoniazid (H) and rifampicin (R) with or without resistance to other drugs. RR - Rifampicin resistant).
# A tibble: 43 × 2
variable_name definition
<chr> <chr>
1 country Country or territory name
2 rr_new Number of new bacteriologically confirmed pulmonary TB patient…
3 c_newinc_100k Case notification rate, which is the total of new and relapse …
4 cfr Estimated TB case fatality ratio
5 cfr_hi Estimated TB case fatality ratio: high bound
6 cfr_lo Estimated TB case fatality ratio: low bound
7 cfr_pct Estimated TB case fatality ratio expressed as a percentage
8 cfr_pct_hi Estimated TB case fatality ratio: high bound expressed as a pe…
9 cfr_pct_lo Estimated TB case fatality ratio: low bound expressed as a per…
10 e_inc_100k Estimated incidence (all forms) per 100 000 population
# ℹ 33 more rows
After considering the definitions, we can select only the important variables that hold the most descriptive information to get to know the data. We are selecting data per 100k to get the most comparable summary.
Warning: There was 1 warning in `reframe()`.
ℹ In argument: `across(where(is.numeric), list(mean = mean, sd = sd), na.rm =
TRUE)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.
# Previously
across(a:b, mean, na.rm = TRUE)
# Now
across(a:b, \(x) mean(x, na.rm = TRUE))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Interestingly, household contacts seemed to have dropped rapidly during Covid-19 quarantine (happened around april of 2020), only to increase rapidly in the time afterwards.